Tutorial 0: Data Preparation¶

Welcome to the Cell2Sentence (C2S) tutorial series! In this first tutorial, we will guide you through the process of preparing and preprocessing a single-cell RNA sequencing dataset for use with C2S. This step is important because the processed data produced in this notebook will be used in all subsequent tutorials, so make sure to complete this notebook first before moving on to the others if you are using the example data.

If you are processing your own dataset, keep a few things in mind:

  1. This notebook assumes you are starting from a single-cell dataset with raw transcript counts. You can follow these steps to preprocess your dataset for use with C2S.
  2. If you are starting from pre-normalized data, keep in mind that C2S normalization consists of the standard filtering & count normalization steps outlined by Scanpy, with the exception that the log1p transformation is applied with base 10. For best results with the inverse reconstruction (going from cell sentences back to expression, shown in tutorial 1), make sure to use base-10 normalized data as shown in this tutorial.

We will be working with a subset of data from the Immune System tissue dataset by Domínguez Conde et al. This dataset provides an excellent foundation for demonstrating how to convert raw single-cell data into a format that can be used with C2S models.

  • Dataset Information:
    • Source: Immune System tissue dataset from Domínguez Conde et al. (2022)
    • Link: Science Article
    • Citation: Domínguez Conde, C., et al. "Cross-tissue immune cell analysis reveals tissue-specific features in humans." Science 376.6594 (2022): eabl5197.
  • Subset Used: We use data from two donors (A29 and A31), totaling 29,773 cells. This subset is chosen to keep the tutorial data manageable.

First, we will import the necessary libraries. These include general-purpose Python libraries, as well as specialized libraries for single-cell RNA sequencing data analysis.

In [1]:
from __future__ import annotations #default now for name.error issue
import os
import numpy as np
import random
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)

import scanpy as sc
import anndata as ad
from collections import Counter #count table

sc.set_figure_params(dpi=300, color_map="viridis_r", facecolor="white", )
sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
Out[1]:
PackageVersion
numpy2.3.3
pandas2.3.3
matplotlib3.10.6
seaborn0.13.2
scanpy1.11.4
anndata0.12.2
ComponentInfo
Python3.11.13 | packaged by conda-forge | (main, Jun 4 2025, 14:48:23) [GCC 13.3.0]
OSLinux-5.14.0-503.22.1.el9_5.x86_64-x86_64-with-glibc2.34
CPU64 logical CPU cores, x86_64
GPUNo GPU found
Updated2025-11-11 15:42
Dependencies
DependencyVersion
ipython9.6.0
defusedxml0.7.1
python-dateutil2.9.0.post0
prompt_toolkit3.0.52
donfig0.8.1.post1
packaging25.0
matplotlib-inline0.1.7
pyzmq27.1.0
pillow11.3.0
texttable1.7.0
comm0.2.3
debugpy1.8.17
session-info20.2.2
scikit-learn1.7.2
pyparsing3.2.5
Pygments2.19.2
numba0.62.1
kiwisolver1.4.9
typing_extensions4.15.0
msgpack1.1.1
six1.17.0
psutil7.1.0
jupyter_client8.6.3
asttokens3.0.0
pytz2025.2
tornado6.5.2
jedi0.19.2
cycler0.12.1
threadpoolctl3.6.0
stack_data0.6.3
scipy1.16.2
legacy-api-wrap1.4.1
PyYAML6.0.3
natsort8.4.0
executing2.2.1
h5py3.14.0
patsy1.0.1
joblib1.5.2
pure_eval0.2.3
traitlets5.14.3
parso0.8.5
colorama0.4.6
zarr3.1.3
jupyter_core5.8.1
numcodecs0.16.1
platformdirs4.4.0
setuptools78.1.1
statsmodels0.14.5
llvmlite0.45.1
leidenalg0.10.2
igraph0.11.9
charset-normalizer3.4.3
crc32c2.7.1
ipykernel6.30.1
ipywidgets8.1.5
decorator5.2.1
wcwidth0.2.14
Copyable Markdown
| Package    | Version |
| ---------- | ------- |
| numpy      | 2.3.3   |
| pandas     | 2.3.3   |
| matplotlib | 3.10.6  |
| seaborn    | 0.13.2  |
| scanpy     | 1.11.4  |
| anndata    | 0.12.2  |

| Dependency         | Version     |
| ------------------ | ----------- |
| ipython            | 9.6.0       |
| defusedxml         | 0.7.1       |
| python-dateutil    | 2.9.0.post0 |
| prompt_toolkit     | 3.0.52      |
| donfig             | 0.8.1.post1 |
| packaging          | 25.0        |
| matplotlib-inline  | 0.1.7       |
| pyzmq              | 27.1.0      |
| pillow             | 11.3.0      |
| texttable          | 1.7.0       |
| comm               | 0.2.3       |
| debugpy            | 1.8.17      |
| session-info2      | 0.2.2       |
| scikit-learn       | 1.7.2       |
| pyparsing          | 3.2.5       |
| Pygments           | 2.19.2      |
| numba              | 0.62.1      |
| kiwisolver         | 1.4.9       |
| typing_extensions  | 4.15.0      |
| msgpack            | 1.1.1       |
| six                | 1.17.0      |
| psutil             | 7.1.0       |
| jupyter_client     | 8.6.3       |
| asttokens          | 3.0.0       |
| pytz               | 2025.2      |
| tornado            | 6.5.2       |
| jedi               | 0.19.2      |
| cycler             | 0.12.1      |
| threadpoolctl      | 3.6.0       |
| stack_data         | 0.6.3       |
| scipy              | 1.16.2      |
| legacy-api-wrap    | 1.4.1       |
| PyYAML             | 6.0.3       |
| natsort            | 8.4.0       |
| executing          | 2.2.1       |
| h5py               | 3.14.0      |
| patsy              | 1.0.1       |
| joblib             | 1.5.2       |
| pure_eval          | 0.2.3       |
| traitlets          | 5.14.3      |
| parso              | 0.8.5       |
| colorama           | 0.4.6       |
| zarr               | 3.1.3       |
| jupyter_core       | 5.8.1       |
| numcodecs          | 0.16.1      |
| platformdirs       | 4.4.0       |
| setuptools         | 78.1.1      |
| statsmodels        | 0.14.5      |
| llvmlite           | 0.45.1      |
| leidenalg          | 0.10.2      |
| igraph             | 0.11.9      |
| charset-normalizer | 3.4.3       |
| crc32c             | 2.7.1       |
| ipykernel          | 6.30.1      |
| ipywidgets         | 8.1.5       |
| decorator          | 5.2.1       |
| wcwidth            | 0.2.14      |

| Component | Info                                                                           |
| --------- | ------------------------------------------------------------------------------ |
| Python    | 3.11.13 | packaged by conda-forge | (main, Jun  4 2025, 14:48:23) [GCC 13.3.0] |
| OS        | Linux-5.14.0-503.22.1.el9_5.x86_64-x86_64-with-glibc2.34                       |
| CPU       | 64 logical CPU cores, x86_64                                                   |
| GPU       | No GPU found                                                                   |
| Updated   | 2025-11-11 15:42                                                               |

This next cell will set a random seed. This seed will control any random operations that occur in the code, ensuring that they produce the same output each time you run the notebook.

In [3]:
# SEED = 1234
# random.seed(SEED)
# np.random.seed(SEED)

Load Data¶

Next, we will load the dataset. The dataset is stored in the AnnData format, which is commonly used for single-cell RNA sequencing data (scRNAseq). This format efficiently handles large datasets and provides useful functions for analysis.

We will use a dataset containing two donor samples from the Domínguez Conde et al. study. Here is a Google Drive link where you can download the data:

  • https://drive.google.com/file/d/1k5OoDVzk7CusHWEUZO7ENPbMZYjx-p0l/view?usp=sharing

Make sure you have downloaded this file (H5AD format) from our Google Drive link and set the correct file path before running the following code.

In [1]:
DATA_PATH = "/ix/ccdg/storage3/til177/ParkLab/Project_C2S_scFM/code/tutorials/dominguez_conde_immune_tissue_two_donors.h5ad"
In [5]:
# read the data
adata = ad.read_h5ad(DATA_PATH)
adata
Out[5]:
AnnData object with n_obs × n_vars = 29773 × 36503
    obs: 'cell_type', 'tissue', 'batch_condition', 'organism', 'assay', 'sex'
    var: 'gene_name', 'ensembl_id'
In [25]:
print(Counter(adata.obs["batch_condition"]))

# Check the data structure
# adata.X     # main data matrix
print(adata.X.shape)
display(adata.obs.head())   # observation meta-data (cells)
display(adata.var.head())   # variable meta-data (genes)
Counter({'A29': 17327, 'A31': 12446})
(29773, 36503)
cell_type tissue batch_condition organism assay sex
Pan_T7935490_AAACCTGCAAATTGCC CD4-positive helper T cell ileum A29 Homo sapiens 10x 5' v1 female
Pan_T7935490_AAACGGGCATCTGGTA CD8-positive, alpha-beta memory T cell ileum A29 Homo sapiens 10x 5' v1 female
Pan_T7935490_AAACGGGTCTTGCATT CD8-positive, alpha-beta memory T cell ileum A29 Homo sapiens 10x 5' v1 female
Pan_T7935490_AAAGCAATCATCGCTC CD8-positive, alpha-beta memory T cell ileum A29 Homo sapiens 10x 5' v1 female
Pan_T7935490_AAAGTAGCAGTCACTA gamma-delta T cell ileum A29 Homo sapiens 10x 5' v1 female
gene_name ensembl_id
MIR1302-2HG MIR1302-2HG ENSG00000243485
FAM138A FAM138A ENSG00000237613
OR4F5 OR4F5 ENSG00000186092
RP11-34P13 RP11-34P13 ENSG00000238009
RP11-34P13-1 RP11-34P13 ENSG00000239945

~30k cells (2 donors), ~36k genes

The data contains multiple metadata annotations, such as cell type and tissue information. We will look through the dataset attributes quickly:

  • note that the assay is 5'-capture (TCR/BCR)

We can see that we have cell type, tissue, and organism annotations in this dataset. Also, we can see the cell counts from both donors in our immune tissue dataset.

In [26]:
Counter(adata.obs["tissue"])
Out[26]:
Counter({'thoracic lymph node': 7573,
         'spleen': 5794,
         'mesenteric lymph node': 5209,
         'lung': 4458,
         'liver': 2716,
         'bone marrow': 1773,
         'duodenum': 735,
         'ileum': 379,
         'thymus': 354,
         'caecum': 339,
         'transverse colon': 262,
         'sigmoid colon': 133,
         'omentum': 30,
         'skeletal muscle tissue': 18})
In [27]:
Counter(adata.obs["cell_type"])
Out[27]:
Counter({'naive thymus-derived CD4-positive, alpha-beta T cell': 4262,
         'memory B cell': 3584,
         'macrophage': 2404,
         'naive B cell': 2223,
         'regulatory T cell': 1582,
         'T follicular helper cell': 1430,
         'CD8-positive, alpha-beta memory T cell, CD45RO-positive': 1369,
         'naive thymus-derived CD8-positive, alpha-beta T cell': 1335,
         'mucosal invariant T cell': 1306,
         'effector memory CD4-positive, alpha-beta T cell': 1249,
         'effector memory CD8-positive, alpha-beta T cell, terminally differentiated': 1181,
         'alveolar macrophage': 1180,
         'classical monocyte': 1003,
         'plasma cell': 715,
         'CD8-positive, alpha-beta memory T cell': 579,
         'CD4-positive helper T cell': 537,
         'CD16-positive, CD56-dim natural killer cell, human': 498,
         'alpha-beta T cell': 486,
         'gamma-delta T cell': 481,
         'lymphocyte': 424,
         'CD16-negative, CD56-bright natural killer cell, human': 341,
         'animal cell': 303,
         'conventional dendritic cell': 251,
         'non-classical monocyte': 250,
         'germinal center B cell': 141,
         'erythroid lineage cell': 135,
         'group 3 innate lymphoid cell': 116,
         'mast cell': 88,
         'plasmablast': 79,
         'progenitor cell': 75,
         'dendritic cell, human': 58,
         'plasmacytoid dendritic cell': 40,
         'megakaryocyte': 39,
         'pro-B cell': 17,
         'precursor B cell': 12})

There are diverse immune cell types and subtypes within this data, which makes for interesting prediction and generation tasks later on!

We also have both gene names as well as ensembl IDs in our data. The gene names are important, since they will make up the cell sentences we create later on, so we will keep gene names as the adata.var_names in our AnnData object. The var_names will be used to create the cell sentences by the C2S code base functions later on.

In [38]:
adata.var
Out[38]:
gene_name ensembl_id
MIR1302-2HG MIR1302-2HG ENSG00000243485
FAM138A FAM138A ENSG00000237613
OR4F5 OR4F5 ENSG00000186092
RP11-34P13 RP11-34P13 ENSG00000238009
RP11-34P13-1 RP11-34P13 ENSG00000239945
... ... ...
ENSG00000277836 ENSG00000277836 ENSG00000277836
ENSG00000278633 ENSG00000278633 ENSG00000278633
ENSG00000276017 ENSG00000276017 ENSG00000276017
ENSG00000278817 ENSG00000278817 ENSG00000278817
ENSG00000277196 ENSG00000277196 ENSG00000277196

36503 rows × 2 columns

In [37]:
adata.var_keys()
adata.var_names
Out[37]:
Index(['MIR1302-2HG', 'FAM138A', 'OR4F5', 'RP11-34P13', 'RP11-34P13-1',
       'RP11-34P13-2', 'RP11-34P13-3', 'RP11-34P13-4', 'AP006222',
       'RP4-669L17',
       ...
       'ENSG00000274175', 'ENSG00000275869', 'ENSG00000273554',
       'ENSG00000278782', 'ENSG00000277761', 'ENSG00000277836',
       'ENSG00000278633', 'ENSG00000276017', 'ENSG00000278817',
       'ENSG00000277196'],
      dtype='object', length=36503)
In [39]:
adata.X
Out[39]:
<Compressed Sparse Row sparse matrix of dtype 'float32'
	with 48394900 stored elements and shape (29773, 36503)>
In [40]:
adata.X.data.shape # is a 1D NumPy array storing all the non-zero values of the sparse matrix.
Out[40]:
(48394900,)
In [41]:
adata.X.data[:10]  # we can see from the count values that we have raw counts
Out[41]:
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 6.], dtype=float32)

Looking at the transcript counts, we can see that we are starting from raw counts (counts, not continuous (normalized)).

Data preprocessing¶

Now we will preprocess and normalize the data, following standard processing pipelines (Scanpy reference tutorial: https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering.html).

Cell2Sentence only deviates from the standard preprocessing and normalization pipeline in that the log transformation is done with a base of 10 rather than natural logarithm. We start with simple filtering steps:

In [42]:
adata
Out[42]:
AnnData object with n_obs × n_vars = 29773 × 36503
    obs: 'cell_type', 'tissue', 'batch_condition', 'organism', 'assay', 'sex'
    var: 'gene_name', 'ensembl_id'
In [43]:
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
filtered out 12559 genes that are detected in less than 3 cells
In [44]:
adata
Out[44]:
AnnData object with n_obs × n_vars = 29773 × 23944
    obs: 'cell_type', 'tissue', 'batch_condition', 'organism', 'assay', 'sex', 'n_genes'
    var: 'gene_name', 'ensembl_id', 'n_cells'

We now can annotate mitochondrial gene counts in our cells, for additional potential filtering steps:

In [45]:
# annotate the group of mitochondrial genes as "mt"
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(
    adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True
)
In [46]:
sc.pl.violin(
    adata,
    ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
    jitter=0.4,
    multi_panel=True,
)
No description has been provided for this image

We have the option here to filter out cells based on quality-control metrics based on mitochondrial genes, but following the guidance of the Scanpy tutorial (https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering.html#quality-control), we will stick with a lenient filtering based only on min_cells and min_genes in the previous cell. This filtering step can be revisited if desired to filter out more cells based on QC metrics.

In [17]:
# adata = adata[adata.obs.n_genes_by_counts < 6000, :]
# adata = adata[adata.obs.pct_counts_mt < 50, :].copy()

Normalization¶

Now, we normalize our data. We follow the typical count normalization and log1p transformation steps to normalize single-cell datasets for use with C2S. The only difference is that we perform the log1p transformation with base 10, rather than the default natural logarithm. We empirically found this base 10 log1p transformation to work well for inverse reconstruction of expression using a linear model, which will be shown in tutorial notebook 1.

In [47]:
# Count normalization
sc.pp.normalize_total(adata)
# Lop1p transformation with base 10 - base 10 is important for C2S transformation!
sc.pp.log1p(adata, base=10)  
normalizing counts per cell
    finished (0:00:05)
In [48]:
adata.X.max()
Out[48]:
np.float32(3.408124)

With a base 10 log transformation, the largest value in our count matrix is now ~3.4, which is expected for a base-10 log transformation.

Visualization¶

Now let's process our adata object for UMAP visualization of our dataset. We will run PCA, Scanpy's neighbors algorithm, and then the UMAP algorithm implemented by Scanpy.

In [49]:
sc.tl.pca(adata)
computing PCA
    with n_comps=50
    finished (0:00:42)
In [50]:
sc.pp.neighbors(adata)
computing neighbors
    using 'X_pca' with n_pcs = 50
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:01:39)
In [51]:
sc.tl.umap(adata)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm)
    'umap', UMAP parameters (adata.uns) (0:00:22)

We can visualize the UMAPs of the processed data colored by donor, cell type, and tissue metadata

In [52]:
sc.pl.umap(
    adata,
    color="batch_condition",
    size=8,
    title="Human Immune Tissue UMAP",
)
No description has been provided for this image
In [55]:
sc.pl.umap(
    adata,
    color="cell_type",
    size=8,
    title="Human Immune Tissue UMAP",
)
No description has been provided for this image
In [56]:
sc.pl.umap(
    adata,
    color="tissue",
    size=8,
    title="Human Immune Tissue UMAP",
)
No description has been provided for this image

Finally, we will save dataset to disk. This processed dataset will be used for all further tutorials, so make sure to keep this filepath handy for future tutorials.

In [57]:
SAVE_PATH = "/ix/ccdg/storage3/til177/ParkLab/Project_C2S_scFM/code/tutorials/dominguez_conde_immune_tissue_two_donors_preprocessed_tutorial_0.h5ad"
In [58]:
# adata.write_h5ad(SAVE_PATH)